The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix.
The classical methods for distance measures are Euclidean and Manhattan distances. 1
Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses
There are multiple types of correlation methods , such as:
The standardization of data is an approach widely used in the context of gene expression data analysis before clustering. We might also want to scale the data when the mean and/or the standard deviation of variables are largely different.
The R base function scale() can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns
data("USArrests") # Loading the data set
set.seed(123)
ss <- sample(1:50, 15)
df <- USArrests[ss, ]
df.scaled <- scale(df)
#View(df.scaled)This will calculate the mean and standard deviation of the entire vector, then “scale” each element by those values by subtracting the mean and dividing by the sd
When computing distances we can use a broad array of functions, such as :
dist() is the base function and it only accepts numeric data input.get_dist() is similar to the base function but it also supports correlation based distance measures .daisy()function is able to handle other types of variables and the Gower coefficient will be used as the metric.As an example we can compute the euclidian distance :
#To compute Euclidean distance, you can use the R base dist() function, as follow:
dist.eucl <- dist(df.scaled, method = "euclidean")
# Reformat as a matrix
# Subset the first 3 columns and rows and Round the values
round(as.matrix(dist.eucl)[1:3, 1:3], 1)## New Mexico Iowa Indiana
## New Mexico 0.0 4.1 2.5
## Iowa 4.1 0.0 1.8
## Indiana 2.5 1.8 0.0
To visualize distance matrices we cand use the function fviz_dist()
Example:
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
Red indicates high similarity between observations displayed in consecutive order
Blue indicates low similarity between observations
K-means custering is one of the simplest and popular unsupervised machine learning algorithm for partitioning a given data set into a set of k clusters / groups, where k represents the number of groups pre-specified by the analyst. Objects within the same cluster are as similar as possible (high intra-class similarity). By contrast, objects from different clusters are as dissimilar as possible (low inter-class similarity). A centroid corresponds to the mean of points assigned to the cluster, it might as well be called the center of the cluster.
The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:The total within-cluster sum of square should be as small as possible in order to obtain a good clustering.
The algorithm follows these steps:
The k-means algorithm should not depend on any arbitrary variable unit, so the data should be scaled using the R function scale():
data("USArrests") # Loading the data set
df <- scale(USArrests) # Scaling the data
# View the first 3 rows of the data
head(df, n = 3)## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
K-means clustering requires the users to specify the number of clusters to be generated. But how does the user know the right number of expected clusters k ?
A simple solution is to compute k-means clustering using different values of clusters k. The wss (within sum of square) is drawn according to the number of clusters. An good plot, with appropiate number of clusters, would resemble a bended knee.
To estimate the optimal number of clusters, the R function fviz_nbclust() is being used:
library(factoextra)
fviz_nbclust(df, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)This plot reflects the variance within the clusters, which decreases as k increases. After k=4, additional clusters beyond the fourth have little value.
It’s always recommended to use the set.seed() function in order to set R’s random number generator, because k-means clustering starts with k randomly selected centroids.
The following code performs k-means clustering with k=4. R will try 25 different random starting assignments and then select the best results (the ones with the lowest within cluster variation):
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 0.6950701 1.0394414 0.7226370 1.27693964
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 3 3 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 2 4 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 3 2 1 2 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 4 2 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 4 2 3
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 2 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 4 3 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 11.952463 16.212213 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The printed output shows:
To compute the mean of each variables by clusters, the following function can be used:
## cluster Murder Assault UrbanPop Rape
## 1 1 13.93750 243.62500 53.75000 21.41250
## 2 2 3.60000 78.53846 52.07692 12.17692
## 3 3 5.65625 138.87500 73.87500 18.78125
## 4 4 10.81538 257.38462 76.00000 33.19231
To add the point classifications to the original data:
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 4
## Arizona 8.1 294 80 31.0 4
## Arkansas 8.8 190 50 19.5 1
## California 9.0 276 91 40.6 4
## Colorado 7.9 204 78 38.7 4
kmeans() function returns the following list of components:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 3 3 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 2 4 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 3 2 1 2 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 4 2 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 4 2 3
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 2 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 4 3 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 2 2 3
## [1] 8 13 16 13
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 0.6950701 1.0394414 0.7226370 1.27693964
If we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.
fviz_cluster() is used to easily visualize k-means clusters. As arguments, it takes k-means results and the original data. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. Around each cluster, it is possible to draw concentration ellipse:
fviz_cluster(km.res, data = df,
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
ellipse.type = "euclid", # Concentration ellipse
star.plot = TRUE, # Add segments from centroids to items
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_minimal()
)Advantages:
Weaknesses:
A good alternative to k-means is PAM, which uses medoids.
K-means clustering is used to arrange observations into k groups. Each group is represented by the mean value of points in the group / cluster centroid.
The R function kmeans() is used to compute k-means algorithm.
After computing kmeans, fviz_cluster() is used to visualize the results.
K-medoids algorithm is related to the k-means clustering algorithm, which paritioned data sets into k groups of clusters. In k-medoids, each cluster is represented by one of the data point in the cluster (the points are called cluster medoids),
The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. These objects are considered as a representative example of the members of that cluster. This differs from k-means clustering, where the center of a given cluster is calculated as the mean value of all the data points in the cluster.
As in the k-means algorithm, the number of clusters to be generated is given as an input from the user. The silhouette method is a good way to determine the optimal number of clusters.
Compared to k-means, the algorithm is less prone to noise and outliers due to the fact that it uses medoids.
The most common k-medoids clustering method is **PAM (Partitioning Around Medoids).
The PAM algorithm follows these steps:
To compute the matrix of dissimilarity, the algorithm can use two metrics:
The function pam() and pamk() can be used to compute PAM. pamk does not require a user to decide the number of clusters k.
pam() function format:
pam(x, k, metric = “euclidean”, stand = FALSE)
The average silhoutte method is used to estimate the optimal number of cluster. The idea is to compute PAM algorithm using different values of clusters k, which will determine the average clusters silhouette. The quality of the clustering is given by the average silhouette. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust() is a function that estimates the optimal number of clusters:
It can be seen from the plot above that the optimal number of clusters is 2.
The following computes PAM algorithm when k=2:
## Medoids:
## ID Murder Assault UrbanPop Rape
## New Mexico 31 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska 27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Objective function:
## build swap
## 1.441358 1.368969
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
The output displays:
To add the point classifications to the original data:
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 1
## Arizona 8.1 294 80 31.0 1
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Alabama Alaska Arizona Arkansas California Colorado
## 1 1 1 2 1 1
fviz_cluster() is used to visualize the partitioning results.A scatter plot of data points colored by cluster numbers is drawn . If the data contains more than 2 variables, the Principal Component Analysis (PCA) algorithm is used to reduce the dimensionality of the data. The first two principal dimensions are used to plot the data.
fviz_cluster(pam.res,
palette = c("#00AFBB", "#FC4E07"), # color palette
ellipse.type = "t", # Concentration ellipse
repel = TRUE, # Avoid label overplotting (slow)
ggtheme = theme_classic()
)A good alternative to k-means algorithm is k-medoids, where each cluster is represented by a selected object within the cluster. The objects correspond to the most centrally located points within the cluster.
The PAM algorithm requres the user to know the data and to indicate the appropiate number of clusters to be produced. pam(x, k) can be used to compute the algorithm. x represents the data and k is the number of clusters to be generated. For large data sets, pam() may need too much memory or computation time. An alternative is clara()
After the clustering is performed, fviz_cluster() can be used to visualize the results.
CLARA is an extension to k-medoids methods in order to deal with data containing large numbers of objects. The clara method considers a small sample of the data and generates an optimal set of medoids using the PAM algorithm.
The algorithm is as follow:
The function clara() [cluster package] can be used to compute CLARA.
clara(x, k, metric = “euclidean”, stand = FALSE, samples = 5, sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE, keep.data = medoids.x, rngR = FALSE)
To estimate the optimal number of clusters in your data, it’s possible to use the average silhouette method as described in PAM
To visualize the partitioning results, we’ll use the function fviz_cluster() ,it draws a scatter plot of data points colored by cluster numbers
# Compute CLARA
clara.res <- clara(df, 2, samples = 50, pamLike = TRUE)
fviz_cluster(clara.res,
palette = c("#00AFBB", "#FC4E07"), # color palette
ellipse.type = "t", # Concentration ellipse
geom = "point", pointsize = 1,
ggtheme = theme_classic()
)The agglomerative clustering starts by treating each object as a singleton cluster until they are merged into one cluster containing all objects. The result of this tree based representation is called a dendogram
The steps are as follows:
# Load the data
data("USArrests")
# Standardize the data
df <- scale(USArrests)
# Show the first 6 rows
head(df, nrow = 6)## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Dendrograms correspond to the graphical representation of the hierarchical tree generated by the function hclust().
# Load data
data(USArrests)
# Compute distances and hierarchical clustering
dd <- dist(scale(USArrests), method = "euclidean")
hc <- hclust(dd, method = "ward.D2")
plot(hc)In the dendrogram displayed above, each leaf corresponds to one object. As we move up the tree, objects that are similar to each other are combined into branches, which are themselves fused at a higher height.
The vertical distance between two clusters is called cophenetic distance.
The R base function cophenetic() can be used to compute the cophenetic distances for hierarchical clustering.
# Compute the dissimilarity matrix
# df = the standardized data
res.dist <- dist(df, method = "euclidean")
res.hc <- hclust(d = res.dist, method = "ward.D2")
# Compute cophentic distance
res.coph <- cophenetic(res.hc)
# Correlation between cophenetic distance and
# the original distance
cor(res.dist, res.coph)## [1] 0.6975266
One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.
We can use the function cutree() to cut a tree by specifying either the height or the number of groups desired.
## Alabama Alaska Arizona Arkansas
## 1 2 2 3
In order to visualize the result in a scatter plot we can use the function fviz_cluster()
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)K-means is one of the most popular clustering algorithm, but it has some limitations: the number of clusters must be specified in advance and the initial centroids are selected randomly. As a result, the final k-means clustering is tightly coupled to the initial random selection of cluster centers.
The algorithm follows these steps:
hkmeans() gives an easy solution to compute the algorithm:
# Compute hierarchical k-means clustering
library(factoextra)
res.hk <-hkmeans(df, 4)
# Elements returned by hkmeans()
names(res.hk)## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 0.6950701 1.0394414 0.7226370 1.27693964
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 2 2 1 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 19.922437 16.212213 11.952463
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
# Visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco",
rect = TRUE, rect_border = "jco", rect_fill = TRUE)# Visualize the hkmeans final clusters
fviz_cluster(res.hk, palette = "jco", repel = TRUE, ggtheme = theme_classic())DBSCAN stands for Density-Based Spatial Clustering and Application with Noise and it is used to identify clusters of any shape in a data set containing noise and outliers.
Clusters are dense regions in the data space. They can be easily differenced from the lower density points. The idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points. The clusters can be easily identified in the image below:
For spherical clusters, it is indicated to use partitioning methods (K-means, PAM clustering) or hierarchical clustering. They work well for compact and well separated clusters and they are also affected by noise and outliers in the data.
However, real life brings up several challanges such as:
The plot contains 5 clusters and outliers (2 ovales clusters, 2 linear clusters and 1 compact cluster).
In such conditions, k-means algorithm has problems identifying these clusters. Below is an example of code that computes k-means algorithm on the multishapes data set.
library(factoextra)
data("multishapes")
df <- multishapes[, 1:2]
set.seed(123)
km.res <- kmeans(df, 5, nstart = 25)
fviz_cluster(km.res, df, geom = "point",
ellipse= FALSE,
show.clust.cent = FALSE,
palette = "jco",
ggtheme = theme_classic()
)The plot above shows that the k-means method inaccurately identifies 5 clusters.
A density-based cluster is defined as a group of density connected points. It follows these steps:
To visualize DBSCAN using multishapes data set:
# Load the data
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]
# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df,
stand = FALSE,
ellipse = FALSE,
show.clust.cent = FALSE,
geom = "point",
palette = "jco",
ggtheme = theme_classic()
)Compared to k-means algorithms, DBSCAN performs better when identifying the correct clusters.
Displaying the results:
## dbscan Pts=1100 MinPts=5 eps=0.15
## 0 1 2 3 4 5
## border 31 24 1 5 7 1
## seed 0 386 404 99 92 50
## total 31 410 405 104 99 51
In the table above, column names are cluster number. Cluster 0 corresponds to outliers (black points in the DBSCAN plot). The function print.dbscan() shows a statistic of the number of points belonging to the clusters that are seeds and border points.
# Cluster membership. Noise/outlier observations are coded as 0
# A random subset is shown
db$cluster[sample(1:1089, 20)]## [1] 2 2 1 2 1 4 1 2 2 2 0 4 1 1 3 1 2 1 4 2
In the R code above, we used eps = 0.15 and MinPts = 5. One limitation of DBSCAN is that it is sensitive to the choice of eps, in particular if clusters have different densities. If eps is too small, sparser clusters will be identified as noise. If eps is too large, denser clusters may be merged together. This implies that, if there are clusters with different local densities, then a single eps value may not be enough.
The idea is to calculate the average of the distances between every point and its k nearest neighbors. The value k is specified by the user and corresponds to MinPts.
The aim is to find and optimal eps (find the “knee” when plotting k-distances in ascending order).
The optimal eps value is around a distance 0.15.